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It is our distinct pleasure to comment on a very thought provoking paper, and we first 
congratulate the Authors for this new masterly contribution in the held of objective 
priors. 

The main goal of the paper is to hnd a multi-purpose objective prior for a model 
that should be used by different researchers with varying goals, with the consequence 
that no single parameter or parametric function can be identihed as a parameter of 
interest. In this situation, the most popular approaches either fail or, as in the case of 
the reference prior algorithm, they cannot be used. 

Three general methods are discussed by the Authors. The hrst one is limited to a 
number of particular situations where the reference prior is the same for all quantities 
of interest: this case is not of much concern since a natural solution exists. The second 
method is based on the reference prior approach: one looks for the prior which produces 
the marginal posteriors for the quantities of interest which are closer - in some sense - 
to the marginal reference posteriors. Whereas this method is perfectly reasonable, the 
final result will depend on the particular set of the quantities of interest considered and 
it cannot be considered as the “overall” objective prior. The third method is based on 
a hierarchical representation of the model, when it is available. It shifts the problem of 
determining an objective prior to an upper level of the hierarchy, where the impact of 
the prior might be less serious. 

We believe that the latter method is superior to the others because 

• it is compatible with a predictive approach where all the parameters are nuisance 
parameters and there is no particular quantity of interest; however, one should 
be careful here: if the quantity of interest is, for example, the posterior predictive 
mean 

A(A„+i I Ai,...,A„) 

of a future observation - and not the entire predictive density - then a parameter 
of interest actually does exist! 

• it is clearly superior to Method 2, especially when the model is used repeatedly 
by different people which are interested in different sets of parameters. 

In terms of prediction, it would be worth discussing the proposal of Datta et al. 

( 2000 ). 


•Main article DOI: 10.1214/14-BA915. 
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In this contribution, we will briefly consider the multinomial example, and provide 
some comments on the concept of prior averaging. 


1 The multinomial model in the sparse case 

This is a very interesting problem. Jeffreys’ prior allocates a weight of 1/2 to each 
original component of the vector (0i, 02i • • ■ i ^m)- This is too much when m is large 
compared to the sample size n and the distribution is very sparse. This suggests that 
the prior mass should be adequately spread on the parameter space in such a way that 
each cell has a negligible prior mean, especially when compared with the weight of the 
data. 

In the multinomial case, the prior weight (expressed as the sum of the hyper¬ 
parameters of the Dirichlet prior) is equal to m/2 for the Jeffreys’ prior, while in the 
hierarchical approach, arising from a Dirichlet(a, a,..., a) hyper-prior, it is a random 
quantity v = ma with density given by expression (25) of the paper, at least in the 
case of an infinite m. Several numerical computations, with different values of n and 
tq (i.e., the number of non-empty cells), show that the mode and the median of v are 
rarely larger than 2, so the hierarchical approach automatically accounts for the sparsity 
and the corresponding marginal posteriors are dramatically different from those arising 
from the use of Jeffreys’ prior. 

There are many ways in which this problem can be handled. If we transform it to a 
multiple testing problem, that is, for each cell i we test 

Ho:B, = 0 vs. Hi -.9,^ 0, 

the problem can be rephrased as that of finding an ad-hoc prior, just like in the sparse 
normal problem, which is well studied in literature, see, for example, Scott and Berger 
(2010). The two problems are similar but not identical: here we do not necessarily 
observe data for each cell, and the difficulties associated with this discrete version of 
the problem are even greater since the values of the 0i’s will affect the standard deviation 
of the cells, not only the means. 

From a testing perspective there is also another interesting connection: the Authors 
propose to add - as a prior weight - something close to 1/m to each cell. So the 
total weight of the prior will be approximately one. This reminds us of the unit prior 
information of Kass and Wasserman (1996). 

The sparse multinomial case is also of theoretical interest because it represents a 
bridge between parametric and non-parametric models, when the number of cells goes 
to inhnity. 

Our personal view of the example is close to that of the Authors, although it is not 
of great surprise that the Jeffreys’ prior does not clearly discriminate between observed 
and non-observed cells, when n is so small compared to m. In other words, this is too 
much to ask of the prior. When n is as small as 3, and the number of parameters is 
about 1000, it is hopeless to find a good automatic objective prior and some external 
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guidance (in this case, the choice of a “proper” prior within the Dirichlet class) seems 
unavoidable. 

More interesting is the fact that the hierarchical prior depends on m and n only 
through their ratio: this is actually what one would expect. 

We have also considered a variant of the multinomial example. In particular, we have 
considered the case when the multinomial likelihood can be rephrased as one arising from 
a sample of m independent Poisson random variables with mean vector (tpi,... jtpm) 
and then setting 0j = Doing the usual reference prior calculations here, we 

ended up with the same conclusions as if we have used the standard Jeffreys’ Beta 
prior (1/2,1/2) for the 9ds. We wonder how to get the same result (weights ~ m~^ for 
the cells) in this alternative perspective. It is very likely that this can be obtained by 
assuming independent gamma priors with shape parameter a and scale parameter P for 
the ■ If fho “nuisance” scale parameter j3 is eliminated by conditioning on the total 
counts, we end up with the same conclusion. However, the rationale behind this last 
choice is - again - only pragmatic. 

A related issue is the ordered multinomial example in Section 2.1.2. Here the overall 
prior for any of the parameters (^i,..., is the product of independent Beta(I/2,1/2): 
what happens for large ml Is the overall prior still a sensible prior or should we take 
into account this problem? 

2 A comment on geometric average of priors 

Consider the following divergence function 



where ai,..., am > 0 are suitable constants adding to I, and 7ri(0) may be a suitable 
objective prior when one is interested in one of a given set of m parametric functions. 
The above function is a weighted average Kullback-Leibler divergence between a global 
prior and the marginal priors we would like to use in the case we were interested in a 
single parametric function tiiO), i = 1,... ,m. Note that 




By Jensen’s inequality, d(rf) will be minimized with respect to rj if rjid)/ ]/[” is 

a degenerate function. This leads to the geometric mean prior 


m 


TTG{d) OC ]^7r“‘(6»). 
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Usually, the component priors 7ri(0)’s are improper, which in turn may also make 
TTciS) an improper prior. The authors indicated that the geometric mean prior is prefer¬ 
able to the arithmetic mean prior since one or more of the component priors may be 
improper, and the arithmetic mean posterior may be highly influenced by one or a few 
component posteriors. Indeed, for any arbitrary positive constant Ci, CiiTiiO) is as much 
an objective prior as 'Ki{0) is. While the posterior propriety of the arithmetic mean prior 
is an immediate consequence of the propriety of the component posteriors, the same is 
not so obvious for the geometric mean prior. However, the following lemma shows that 
the posterior corresponding to ttc (0) will be proper provided that each component prior 
Tri{0) generates a proper posterior. 

Lemma 1. For two prior densities g{9) and if 

J fi{9)L{9',x)d9 < oo, and J i'{9)L{9-,x.)d9 < oo, 

then, for any a € (0,1), 

J g°‘{9)v^-°^[9)L{9-^)d9 < oo, 

where L{9; x) denotes the joint density of data x corresponding to the parameter value 9. 
Proof By Holder’s inequality, it follows that 


g°‘{9)i^^-°‘{9)L{9;x)d9 = / [^(6»)L(6»; x)]“ [i/(6»)L(6»; x)] 


l-a 


d9 


< 


g{9)L{9; x.)d9 


-I l-a 


v{9)L{9] x.)d9 


Thus “ generates a proper posterior density for the given data x. 


□ 


By repeated use of this lemma, the propriety of the posterior based on the geometric 
prior Trc{9) easily follows. 


3 An anecdote 

While preparing the present comments one of the authors attended a seminar on applied 
probability where the following situation was presented. In a small village, there is a 
chief and several shepherds. Each shepherd runs a flock of sheeps. The chief knows that 
the ground of their village is going to become parched so the shepherds have to move 
away. All the roads starting from the village - but one - are full of hungry wolves. 
The chief has his own probability distribution about which is the safe road. If the chief 
communicates his/her information to the shepherds, it is very likely that all of them 
would choose the same road. This implies that either all the sheeps or none will survive. 
If the chief does not communicate his/her information, it is likely that the shepherds 
will randomly choose the road. 
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The question is: should the chief share this information with the shepherds or not? 
If so, (s)he is playing a risky (all or nothing) strategy. If not, (s)he is taking a minimax 
strategy where it is more likely that some of the flocks will survive. Is there a way to 
calibrate the amount of information to be shared? 

There are several interesting similarities between this story and the main issue of 
the paper. Is there a way to find a compromise between the general goal and a single 
objective? Is it possible to find a prior - or a strategy - which is not so bad for any of 
the problems at hand? 

Our view is that, if the answer is “yes”, this prior should not depend on the particular 
list of problems. In other words, it would be great to have just “one” overall prior. In 
this respect, the hierarchical approach seems to be more promising. 
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